Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] New Zig formal grammar #1685

Merged
merged 25 commits into from
Nov 13, 2018
Merged

[WIP] New Zig formal grammar #1685

merged 25 commits into from
Nov 13, 2018

Conversation

Hejsil
Copy link
Contributor

@Hejsil Hejsil commented Oct 27, 2018

This is an attempt at formalizing a Parsing Expression Grammar for the Zig programming language. This is done to find a better solution for #760.

Currently, I have the grammar posted here using the peg parser generator to validate it. The grammar is a breaking change from 0.3.0 (See what changed in db5d479).

The grammar implements #1047 and some of #114.

@Hejsil Hejsil added the work in progress This pull request is not ready for review yet. label Oct 27, 2018
@andrewrk
Copy link
Member

Thanks for doing this work.

(See what changed in 0de46ff).

try (switch (c) {

This seems worse. Why is this necessary?

         if (base.id == (comptime typeToId(T))) { 

Same question

     link_err: errorset{OutOfMemory}!void, 

Can we keep error as the keyword that makes error sets, and use anyerror as the new global error set primitive type?

return HashInt(unsigned_x) ^ (comptime rng.scalar(HashInt));

Would this work? return HashInt(unsigned_x) ^ comptime(rng.scalar(HashInt));

pub const LPOVERLAPPED_COMPLETION_ROUTINE = ?(extern fn (DWORD, DWORD, *OVERLAPPED) void);

This seems worse. Why is this needed?

    assert(1234 == (switch (x) {
        MultipleChoice.A => 1,
        MultipleChoice.B => 2,
        MultipleChoice.C => u32(1234),
        MultipleChoice.D => 4,
    })); 

Same question.

    if (t or (x: {
        assert(f);
        break :x f;
    })) {

Same question.

The 1 token lookahead goal might not be an interesting goal to reach, but we are really close

I agree that 1 token lookahead is not an important goal to reach. I will happily trade a couple token lookaheads for any other syntactic gain.

It would also nice we had the bison parser the compiled on every commit since bison can detect ambiguaties through conflicts. Using the parser on all .zig files would also be a good idea, so that we ensure that stage1 and 2 conforms to the spec.

Let's keep the dependencies of Zig at a minimum, that is, a system c++ compiler, libLLVM, and libclang. However I would be open to a separate repository dedicated to testing Zig grammar, which we could have the CI use to run on every commit.

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 27, 2018

The reason switch and comptime require parens are because I gave them a high precedence, together with the other control flow expressions.

For comptime, it was done because of this mess:

// A lower percedence `comptime` Expr rule would cause this ambiguity:
async<comptime A> fn()void
//async<(comptime A> fn()void)
//async<(comptime A)> fn()void

I think we could also solve this with comptime (expr).

For switch and blocks, this was done to correctly formalise the rule that these statements shouldn't have semicolons behind them:

// Prev rules:
// Statement
//    : SwitchExpr
//    | Expr Semicolon
//    ...
// PrimaryExpr
//    : SwitchExpr
//    ...
// All these are valid, with the former grammar
switch (a) {}
switch (a) {};
{}
{};

// It would also cause this ambiguity
{}{}; // Is this two blocks, or a block followed by an initializer
switch (a) {}{}; // Same with switch

The requirement of parens around fn types is to resolve this:

fn()fn()void!void
//fn()(fn()void!void) // This is how it is parsed with the new grammar
//fn()(fn()void)!void

// This solution requires this grammar
// FnTypeExpr
//     : ErrorUnionExpr
//     | FnTypePrefix ErrorUnionExpr // Just noticed, that this ErrorUnionExpr should be a FnTypeExpr
// 
// ErrorUnionExpr
//     : PrefixExpr
//     | PrefixExpr ExclamationMark PrefixExpr
// 
// PrefixExpr
//     : SuffixExpr
//     | PrefixOp PrefixExpr // To have []fn()void, parens is required

I'll look more into laxing these paren requirements. If you have any ideas, I'm all ears :)

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 27, 2018

I like the seperate repo idea btw. Where do we keep the grammar? In the Zig or Zig-grammar repo?

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 27, 2018

Can we keep error as the keyword that makes error sets, and use anyerror as the new global error set primitive type?

We could, but what abouterror.SomeName.

@andrewrk
Copy link
Member

Where do we keep the grammar?

In the separate repo I think. ziglang/zig is an implementation of the zig specification (which isn't written yet; see #75) using recursive descent, and the grammar repo would be a tool used for validating and testing the formal grammar specification.

We could, but what abouterror.SomeName.

We could make that continue to work with special syntax, yes? It's always been syntactic sugar for error{SomeName}.SomeName.

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 27, 2018

We could make that continue to work with special syntax, yes? It's always been syntactic sugar for error{SomeName}.SomeName.

Right, we could do that. Was trying to keep the Expr Dot Symbol to also handle this case, but I guess there is really no need for that. We can just have this PrimaryExpr: Keyword_error Dot Identifier | ...

@winksaville
Copy link
Contributor

winksaville commented Oct 27, 2018

Since we're formalizing the grammer I'd like to suggest allowing seperators in numeric literals to improve readability.

I did some research and C++14 uses single quote ' and other languages use underscore _; rust, java, swift, ruby, perl, python and maybe others.

Of course there is at least one language where it was discussed and rejected, Go here and here.

@ghost
Copy link

ghost commented Oct 27, 2018

async<comptime A> fn()void

'>' acts exactly like '{' in this case. In some places an expression can not contain '{' it always starts the function body. Here '>' always closes async. In both cases you can allow the use of parentheses to have the code parsed differently. You could always unify these simpler expressions, disallowing both '{' and '>' in both cases.

The same could be said about '[' and ']'. When inside '[' the next ']' does not apply to the current expression but to the parent. Ignoring the fact that Zig does not have a ']' operator, but that's besides the point.

// A ']' operator would work fine
const a = b ] c;
const z = b[(c]d)];

async<comptime A> fn()void also works fine.

Maybe comptime(expr) would also work fine as was suggested. But I think the strategy above might be something to keep in mind when these issues pop up.

@ghost
Copy link

ghost commented Oct 27, 2018

From the compiler writer's POV, I think it's really just about keeping track of what token stops the current expression and returns to the parent. Sometimes it's ']', sometimes it's ',', sometimes it's '{', sometimes it's '>'.

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 27, 2018

There are still a few ambiguous that I'm not sure how to fix:

// async<A> (fn()void) ()
// (async<A> fn()void) ()
_ = async<A> fn()void ();

'>' acts exactly like '{' in this case. In some places an expression can not contain '{' it always starts the function body. Here '>' always closes async. In both cases you can allow the use of parentheses to have the code parsed differently. You could always unify these simpler expressions, disallowing both '{' and '>' in both cases.

This is not a problem with <>. The problem still exists with this example:

// async (fn()void) ()
// (async fn()void) ()
_ = async fn()void ();

The grammar is ambiguous between an async call and an async function type.

@ghost
Copy link

ghost commented Oct 27, 2018

Yes, my bad I did edit my answer.

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 27, 2018

@winksaville #504

@ghost
Copy link

ghost commented Oct 27, 2018

I think this is a nice attempt at making the grammar context free but now that the grammar is simpler for machines it needs to be refined for humans. These things stick out to me as being very awkward:

// unintuitive parentheses around if expr
const A = B + (if (rem == 0) 0 else (os.page_size - rem));

// unintuitive parentheses around return expr, now no parentheses around if expr
const A = if (B >= T.bit_count) (return 0) else @intCast(Log2Int(T), abs_shift_amt);

// no parentheses around second return stmt, parentheses around first
if (b) |b_p| (return eql(a_p, b_p)) else |_| return false;

// parentheses around if stmt
pub const ChildProcess = struct {
    pub pid: (if (is_windows) void else i32),
}

I shortened some of the identifier names.

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 27, 2018

@UniqueID1
I agree that the parens around the return expressions are bad, though I think some enforcement of parens around if is not too bad of an idea (maybe not when it's alone, but for some expressions, it would definitely improve readability).

@wirelyre
Copy link

@UniqueID1 A language is a set of strings. In this case, the set of all syntactically legal Zig programs. A grammar is a structured way of representing a language.

One property of a language (but not a grammar) is whether it is context free. If you can write a grammar for Bison, then the language it parses is context free.

BNF grammars define languages unambiguously, but not parse trees. Bison emits a warning when there are two different ways to interpret the same input. This does not affect which inputs are accepted, but rather why they are accepted (that is, not the question "is this a legal program?", but rather "what is the structure of this program?").

It is not really enough to write a grammar, but let the resolution of ambiguous parses depend on a hand-written parser. Now you have to understand the parser program; and additionally, if the parser program does not correctly implement the language defined by the grammar, you're sunk.

@ghost
Copy link

ghost commented Oct 28, 2018

@wirelyre Thanks. I deleted my comment.

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 28, 2018

Alright, here are the two ways we can choose for comptime and return to work together with the "block exprs":

Expr: ControlFlowExpr

ControlFlowExpr
    : if lparen Expr rparen ControlFlowExpr
    | if lparen Expr rparen ReturnExpr else ControlFlowExpr
    | ReturnExpr

ReturnExpr
    : return PrimaryExpr
    | PrimaryExpr

PrimaryExpr
    : lparen Expr rparen
    | num

// Valid code
// if (1) return 1 else return 1
// return (if (1) 1 else 1)

or

Expr: ReturnExpr

ReturnExpr
    : return ControlFlow
    | ControlFlow

ControlFlowExpr
    : if lparen Expr rparen ControlFlowExpr
    | if lparen Expr rparen PrimaryExpr else ControlFlowExpr
    | PrimaryExpr

PrimaryExpr
    : lparen Expr rparen
    | num

// Valid code
// if (1) (return 1) else (return 1)
// return if (1) 1 else 1

The simple grammar can't have both, because it is ambiguous:

Expr: ControlFlowExpr

ControlFlowExpr
    : if lparen Expr rparen ControlFlowExpr
    | if lparen Expr rparen ReturnExpr else ControlFlowExpr
    | ReturnExpr

ReturnExpr
    : return ControlFlowExpr
    | PrimaryExpr

PrimaryExpr
    : lparen Expr rparen
    | num

Derivation 1:

  0: Expr
  1: ControlFlowExpr
  2: if lparen Expr rparen ReturnExpr else ControlFlowExpr
  3: if lparen Expr rparen return ControlFlowExpr else ControlFlowExpr
  4: if lparen Expr rparen return if lparen Expr rparen ControlFlowExpr else ControlFlowExpr
  5: if lparen Expr rparen return if lparen Expr rparen ReturnExpr else ControlFlowExpr

Derivation 2:

  0: Expr
  1: ControlFlowExpr
  2: if lparen Expr rparen ControlFlowExpr
  3: if lparen Expr rparen ReturnExpr
  4: if lparen Expr rparen return ControlFlowExpr
  5: if lparen Expr rparen return if lparen Expr rparen ReturnExpr else ControlFlowExpr

We can special case certain things in the body of the if, to allow for return of simple expressions, but the more we special case, the harder it will be to explain the assosiation and precedence of operators (if is basicly a prefix operator).

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 29, 2018

I've thought long and hard about this, and I don't think we can make this grammar be context-free and unambiguous without sacrificing some of the niceness of the current syntactic constructs, or resorting to a special priority system outside the grammar (which kinda defeats the point).

I propose that we instead create the grammar as a Parsing expression grammar. The pros here are, that the grammar cannot be ambiguous as only the first matching rule will be considered. The cons to this approach are, that PEGs hide syntax flaws.

Here is how I imagine all the current ambiguities will be parsed if we have a PEG:


/  FnDef   \
fn a() !b {}
       ^|
ErrorInf|
     TypeExpr

/          IfExpr          \
          /     IfExpr     \
if (true) if (true) 1 else 1

/   Async call    \
         /FnType\
async<A> fn()void()

/    Call    \
/Async call\
async<A> a()()

/   FnType    \
    /  FnType \
        /ErrUn\
fn()fn()void!u8

/         FnType         \
      /TypeExpr\
async<comptime A> fn()void;

// This one will be hard to parse correctly, since, whith a recursive decent parser
// we will start by parsing it as:
      /    TypeExpr      \
async<comptime A> fn()void;

// The parser should then revert its current parsing, and parse it correctly...
// I think we'll just parse this wrong until we have
// https://github.com/ziglang/zig/issues/1639

Also, we can't have the semicolon rule with PEG if we wanna keep the block expression having high priority:

if (true) {} // Valid
if (true) {}; // Valid, should be an error, but we can't have this without diallowing if {} in if expressions.
if (true) A;  // Valid
if (true) A  // error expected ;

@thejoshwolfe
Copy link
Contributor

It seems like getting rid of <> as grouping operators would solve some problems. C++ has some nasty syntax due to > being both an infix operator and a group closing operator. Zig has the same problem with async.

I know the semantics of async are planned to change in the near future. Perhaps those changes will yield different syntax. I've always been uneasy with <> as grouping operators, so avoiding them entirely seems promising.

@andrewrk
Copy link
Member

I've always been uneasy with <> as grouping operators, so avoiding them entirely seems promising.

In #661 we need a grouping operator to pass a calling convention expression to the fn keyword. Do you have a syntax suggestion for this?

@andrewrk
Copy link
Member

I've thought long and hard about this, and I don't think we can make this grammar be context-free and unambiguous without sacrificing some of the niceness of the current syntactic constructs, or resorting to a special priority system outside the grammar (which kinda defeats the point).

I propose that we instead create the grammar as a Parsing expression grammar. The pros here are, that the grammar cannot be ambiguous as only the first matching rule will be considered. The cons to this approach are, that PEGs hide syntax flaws.

I think this is the way to move forward. 👍 from me. This is always how I imagined the grammar working. I believe that I have incorrectly been using the term "context-free grammar" when I only meant "you can make a parse tree without doing any semantic analysis".

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 30, 2018

The prefix async to async calls is really never gonna work.

Option 1:
SuffixExpr
    <- AsyncPrefix SuffixExpr FnCallArgumnets
     / PrimaryExpr SuffixOp*

const a = async t.t();
AsyncPrefix: "async" Ok
SuffixExpr: "t.t()" Ok
FnCallArgumnets: ";" failed
PrimaryExpr: "async" failed

Option 2:
SuffixExpr <- AsyncCallExpr SuffixOp*
AsyncCallExpr
    <- AsyncPrefix PrimaryExpr FnCallArgumnets
     / PrimaryExpr

const a = async t.t();
AsyncPrefix: "async" Ok
PrimaryExpr: "t" Ok
FnCallArgumnets: "." failed
PrimaryExpr: "async" failed

As we can see, both of these examples cannot be parsed without the parser having to be context aware.

@andrewrk
Copy link
Member

Can you elaborate a little? I'm focused on this copy elision stuff and so I'm not quite grokking your examples. In the current parser we look for a function call expression directly after async and it seems to work. Is there a problem with it?

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 30, 2018

@andrewrk What we do in both parsers, is look at what type the resulting expression of SuffixExpr is. If it's a call, then we add the async info to this node. This is parsing based on context.

@Hejsil
Copy link
Contributor Author

Hejsil commented Oct 30, 2018

What I show in my examples, is how the PEG grammar would parse the example. Failing, because we cannot express this check in the grammar.

@Hejsil
Copy link
Contributor Author

Hejsil commented Nov 12, 2018

I have all stage 1 tests passing locally. As I mentioned, we still have these being parsed incorrectly:

fn a() if (true) A {}
fn a() while (true) A {}
fn a() for ([]void{}) A {}
fn a() comptime A {}
fn a() break A {}
fn a() cancel A {}
fn a() resume A {}
fn a() return A {}

I can trivially make the break, cancel, resume and return cases work, as they can never return a type, so they can get a precedence above TypeExpr.

As for comptime, if, for, while, I think the plan is to have two syntaxes. One for an if type expression and one for if normal expression.

Also, In the grammar, if (true) {}; is a valid statement, but in my implemented parser, it is not. This is because of PEGs unlimited lookahead, whereas in my implementation, each rule only has N lookahead, where N doesn't scale with input.

For now, I'm gonna do the least amount effort to make stage 2 be able to parse the new anyerror syntax, and then leave it be. It's gonna get rewritten at some point anyway, so it can implement the new grammar when that happends.

@Hejsil
Copy link
Contributor Author

Hejsil commented Nov 13, 2018

Also, In the grammar, if (true) {}; is a valid statement, but in my implemented parser, it is not. This is because of PEGs unlimited lookahead, whereas in my implementation, each rule only has N lookahead, where N doesn't scale with input.

This is not correct. I just found out that it's not valid and my grammar does the semicolon rule correctly (from the tests I've done). Horray!

@Hejsil
Copy link
Contributor Author

Hejsil commented Nov 13, 2018

I have all stage 1 tests passing locally. As I mentioned, we still have these being parsed incorrectly:

fn a() if (true) A {}
fn a() while (true) A {}
fn a() for ([]void{}) A {}
fn a() comptime A {}
fn a() break A {}
fn a() cancel A {}
fn a() resume A {}
fn a() return A {}

I can trivially make the break, cancel, resume and return cases work, as they can never return a type, so they can get a precedence above TypeExpr.

As for comptime, if, for, while, I think the plan is to have two syntaxes. One for an if type expression and one for if normal expression.

Implemented!

@Hejsil
Copy link
Contributor Author

Hejsil commented Nov 13, 2018

I have all tests passing locally. This is ready to be merged once CI finish running.

If anyone wish for a zig fmt pass that can convert 0.3.0 code -> new anyerror syntax or one that reverts the struct.{} syntax, I can make one, but I don't really want to if no one is gonna use it :)

I'll make a repo for the grammar soon, and find a way to have it be build into the docs.

@Hejsil Hejsil removed the work in progress This pull request is not ready for review yet. label Nov 13, 2018
@wqweto
Copy link

wqweto commented Nov 13, 2018

Can confirm that I can parse all .zig sources in this PR (incl all tests) with parser.y grammar as input to my (independent) vbpeg parser generator which completely understands Ian's peg/leg dialect for PEGs.

If time permits will annotate my fork of the grammar (in vbpeg tests) with native language actions to produce some kind of parse tree in JSON format for reference.

@Hejsil
Copy link
Contributor Author

Hejsil commented Nov 13, 2018

I ended up making the fmt passes, as I neede them myself:

@gernest
Copy link

gernest commented Nov 13, 2018

@Hejsil can you please help me understand how I can use zig-fmt-error-to-anyerror and zig-fmt-revert-dot-init-container-decl.

This change breaks a lot of stuff I have been working on, an automatic conversion will be helpful .
Thanks.

@Hejsil
Copy link
Contributor Author

Hejsil commented Nov 13, 2018

You will need to build the stage 2 compiler from these branches from source. Guide for that is in the read me

@gernest
Copy link

gernest commented Nov 13, 2018

thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants